After you have completed running the cyclone pipeline, there will be a number of useful outputs, including:
There are many directions that one can follow up with just these outputs.
Batch effect assessment: ‘batch_qc_plots.pdf’ contains 2 umap plots (one colored by and one faceted by the “pool_id” batch information from your file_metadata) as well as numerous heatmaps that either show data by batch directly, or are annotated with batch information. These plots can be useful for assessing if clusters or other features of your data heavily associate with processing batch. Depending on your experimental design, such association may be expected, but when unexpectedly present, it can be a good idea to perform some form of batch correction. cyCombine and CytoNorm are two pathways for batch correction of CyTOF data which our team has used.
Cluster Annotation: Now that your similar cells are grouped together into “clusters”, it’s time to figure out what each cluster represents. This can be a pretty manual process, but is one of the most important steps for making biological sense of the data. It lets us make sense of what differences in cluster frequencies (see below) between samples actually mean. The ‘split_umap_by_clustering.png’, ‘feature_plots.png’, and ‘plots.pdf’ outputs are all useful here. ‘split_umap_by_clustering.png’ and ‘feature_plots.png’ can be used together to determine cell type / lineage defining markers that each cluster expresses, or for a different view, the heatmaps contained within ‘plots.pdf’ show the median expression of each marker within each cluster.
There are also many other directions that your follow up analysis might take you – too may for us to attempt to build them all in to cyclone directly. This vignette will focus some of the most common follow up directions, as well as how to turn your cyclone outputs into a SingleCellExperiment object in order to interface with other tools such as the visualization package dittoSeq.
In this vignette, we will walk through a few common follow up directions. Specifically, we will start with direct cyclone outputs and then work through these follow up steps:
Here, we make use of the SingleCellExperiment object structure because it holds everything we need for our cytometry analysis: space for expression matrices, per-cell metadata, dimensionality reductions like umap, and even per-marker metadata although we won’t be making use of that in this vignette.
In addition to the fact that it can be nice simply to have all the relevant data in a single object, for saving, loading, and version management purposes, using the Biocondoctor-standard SCE format eases interfacing with any of the plethora of single-cell analysis tools which are built around the structure. dittoSeq is but one of many such tools.
The ‘assays’ slot holds matrices in the form of features (rows) by cells
(columns). They can be accessed with assay(<sce>, <assay_name>) OR a default
assay will be retrieved with assay(<sce>).
The ‘colData’ slot holds metadata for cells, and the ‘rowData’ slot holds
metadata for markers. Here, we only make use of ‘colData’ but the slots work
similarly. Their structure is a DataFrame where each rows holds data for a
col (cell) or row (marker) of the object, and columns are the different bits of
information about them. For example, we will have column in ‘colData’ holding
the ‘cluster’ assignments of each cell. For convenience, the SCE maintainers
set up this syntax <sce>$<colData_column> to pull directly from colData, which
makes these data quite easy to access!
The ‘reducedDims’ slot holds dimensionality reductions, such as UMAP, as
matrices. There are accessor functions for these as well, ?reducedDims
We can make use of helper functions included in the cyclone package for this.
Specifically, make_or_load_full_sce(). The main bits of the function are:
sce <- SingleCellExperiment(
assays = list(transformed=t(trans_exp)),
colData = DataFrame(cell_metadata[, !grepl("UMAP",colnames(cell_metadata))]),
reducedDims = list(umap=cell_metadata[, grepl("UMAP",colnames(cell_metadata))]))
See ?make_or_load_full_sce for more details.
full_sce <- make_or_load_full_sce(
# sce_file_name is not used in the first pass because we are neither
# loading nor saving, but in later passes this will load the saved version!
sce_file_name = "cytof_full.Rds",
# Make sure you update these paths to point toward where you have your own
# cyclone outputs saved!
checkpoint1 = "checkpoint_1.RData",
checkpoint8 = "checkpoint_8.RData",
load_checkpoints = TRUE
)
## [ 2023-07-14 14:06:37 ] Loading Checkpoint 1
## Loading objects:
## CHECKPOINT
## raw_exp
## trans_exp
## cell_metadata
## marker_metadata
## file_metadata
## param_list
## [ 2023-07-14 14:06:55 ] Loading Checkpoint 8
## Loading objects:
## file_metadata
## marker_metadata
## cell_metadata
## cluster_metadata
## file_by_cluster_freq
## file_by_cluster_freq_norm
## file_by_cluster_median_exp
## cluster_median_exp
## file_median_exp
## param_list
## [ 2023-07-14 14:06:58 ] No file at 'sce_file_name', Making SCE.
## [ 2023-07-14 14:07:00 ] Turning numeric cluster metadata into factors
## [ 2023-07-14 14:07:01 ] Done.
Now let’s look at the summary of the object:
full_sce
## class: SingleCellExperiment
## dim: 58 2049985
## metadata(0):
## assays(1): transformed
## rownames(58): Time length ... ID_1 ID_2
## rowData names(0):
## colnames(2049985): cell_1 cell_2 ... cell_2049984 cell_2049985
## colData names(10): file_name pool_id ... cluster cluster_6x6
## reducedDimNames(1): umap
## mainExpName: NULL
## altExpNames(0):
An early step of any cluster-based analysis is making sense of what each cluster represents. cyclone’s direct outputs serve quite well for this purpose:
‘split_umap_by_clustering.png’ and ‘feature_plots.png’ can be used together to determine cell type / lineage defining markers that each cluster expresses, or for a different view, the heatmaps contained within ‘plots.pdf’ show the median expression of each marker within each cluster.
We recommend compiling your cluster assignments in Excel or other table/spreadsheet manipulation tool, then saving them as a .csv or .tsv. (There are also tools for loading directly from .xlsx files, but we won’t cover them here.) As long as you have one column containing the cell ids, and another holding cell type names, that’s enough! But generally, it’s useful to record both ‘coarse’ and ‘fine’-level annotations.
A great structure to follow for an annotations.csv is something like:
| cluster | coarse | fine |
|---|---|---|
| 1 | CD4T | CD4T_naive |
| 2 | NK | NK_mature |
| 3 | Monocyte | Monocyte_classical |
annots <- read.csv("annotations.csv")
head(annots)
## cluster coarse fine
## 1 1 CD8T CD8T_EMRA
## 2 2 CD8T CD8T_EMRA
## 3 3 NK NK
## 4 4 NK NK
## 5 5 B plasmablast
## 6 6 ASC ASC
This process is simple because of R’s factor function! Factors are a useful
data structure in R where “levels” (potential values) of the data are
1. pre-defined and 2. ordered. With the factor function, you can pick a vector
to start with, set the levels and their order with the levels input, and then
update what any of those levels are called with the labels input. Usefully,
if any levels are given a matching label, they will be combined together. Thus,
we can make use of this single function to achieve everything we need here –
renaming from cluster_ids to annotations & combination of clusters given same
annotations!
We’ll make use of that function to create and pull in both depths of annotation by starting with the cluster metadata / colData.
head(full_sce$cluster)
## [1] "28" "35" "27" "33" "30" "4"
We’ll make a “coarse_annot” metadata / colData column for the coarse-level.
full_sce$coarse_annot <- factor(
full_sce$cluster,
levels = annots$cluster,
labels = annots$coarse
)
We’ll make a “fine_annot” metadata / colData column for the fine-level.
full_sce$fine_annot <- factor(
full_sce$cluster,
levels = annots$cluster,
labels = annots$fine
)
Ideally, you will have added any sample-specific metadata into the
sample_metadata that went into the cyclone pipeline at the start. Anything
recorded within sample_metadata will have been copied over to the
cell_metadata in the checkpoint_8.Rdata that we used for creating our SCE.
Thus, all that metadata will already be accessible!
But sometimes we receive certain metadata only later. Such data will need to be mapped and loaded in before use.
Here, we’ll go through addition of donor sex information that was missing in the
sample_metadata given to cyclone.
sex_meta <- read.csv("sex_metadata.csv")
head(sex_meta)
## ID Sex
## 1 XVIR1-HS14 F
## 2 XVIR1-HS15 F
## 3 XVIR1-HS16 M
## 4 XVIR1-HS19 M
## 5 XVIR1-HS20 F
## 6 XVIR1-HS21 M
As you can see, we have ‘ID’ and ‘sex’ in this additional metadata file.
Critically, this ‘ID’ column matches with our SCE’s ‘donor_id’ column:
head(full_sce$donor_id)
## [1] "HS11" "HS11" "HS11" "HS11" "HS11" "HS11"
Thus, we can match the ‘ID’ column to our SCE’s ‘donor_id’ column in order to map this sex information to all cells of the data set.
full_sce$sex <- sex_meta$Sex[match(
# Target order
paste0("XVIR1-", full_sce$donor_id),
# Current data order
sex_meta$ID
)]
# View a random few
rand_cells <- sample(1:ncol(full_sce), 6, replace = FALSE)
colData(full_sce)[rand_cells, c("donor_id", "sex")]
## DataFrame with 6 rows and 2 columns
## donor_id sex
## <character> <character>
## cell_2027724 HS9 M
## cell_982390 HS2 N/A
## cell_1173233 HS21 M
## cell_1909757 HS8 M
## cell_1670165 HS6 M
## cell_1751364 HS7 M
Depending on the random seed, you may notice that some samples have a sex of “N/A” because that data is still not known for the control samples. We’ll deal with this later.
We plan to create a standard ingestion process for cells’ image coordinates in a future version of the cyclone pipeline.
Until that point, you can use the procedure here to pull in your cell coordinates.
To do so, you will need a file structured similarly to the ‘sex_metadata.csv’ file used above, but with its ID column mapping to to the cell ‘id’ metadata of your SCE / column of your cell_metadata, and all necessary coordinates information in remaining columns.
In the code below, we’ll target coordinates from a file structured:
| ID | x_loc | y_loc | z_stack |
|---|---|---|---|
| 1 1 | 17 | 116 | 1 |
| 1 2 | 243 | 10 | 1 |
| 1 3 | 197 | 84 | 1 |
coord_meta <- read.csv("coordinates_metadata.csv")
index_matches <- match(
# Target order
full_sce$id,
# Current data order
coord_meta$ID
)
full_sce$x_loc <- sex_meta$x_loc[index_matches]
full_sce$y_loc <- sex_meta$y_loc[index_matches]
full_sce$z_stack <- sex_meta$z_stack[index_matches]
Afterwards, you can use dittoScatterPlot (more details in a later section) to visualize annotations, metadata, or expression by cells’ original location!
Example:
dittoScatterPlot(
full_sce,
x.by = "x_loc",
y.by = "y_loc",
color.by = "<what-to-overlay>",
# Optional selection of a single stack
cells.use = full_sce$z_stack == "<single-stack-id>"
show.others = FALSE
)
Now, we can save our SCE to be able to skip re-running a lot of this code in the
future, using cyclone’s save_sce helper function. (Alternatively, you can use
R’s save or saveRDS function directly. This save_sce function is just a
simple wrapper on top of saveRDS that automatically sets compress = TRUE for
SCEs with more than a million cells.)
save_sce(full_sce, "cytof_full.Rds")
(a color vision deficiency & novice coder friendly visualization tool)
The exact set of plots that will be useful for any given data set depends heavily upon the experimental design and upon what sample metadata is actually available. Maybe you won’t need some of these plots, maybe you will. Our goal in this section is to give a relatively comprehensive overview of plots that are commonly useful. Feel free skip around… none of the code in this section makes changes to the underlying sce object, so feel free to only run chunks that seems useful for your own data!
dittoSeq offers plenty of plotting functions that are useful for cytometry data.
Primary/Required Inputs: All these functions will take in a target ‘object’ (our down_sce for testing or full_sce when ready to make the real version), as well as one or more marker or metadata names to their ‘var’ and/or ‘group.by’, ‘color.by’, and ‘sample.by’. Those alone are minimally enough to make a plot.
Customize-ability: All of them also have plenty of quite useful tweaks and
added functionality built in as well. We’ll go through a few examples, but we’d
encourage you to check out dittoSeq’s own vignette and the functions’ own
documentation (e.g. ?dittoDimPlot) to learn more!
1. Create an SCE representing a subset of the data.
Because it can take multiple minutes to calculate and display even a single plot
from millions of cells, we recommend creating a (further) downsampled version of
your ‘full_sce’ so that visualizations can be tested out and tweaked more
quickly. Here, we’ll give ourselves and object containing just 100k cells using
cyclone’s downsample_sce helper function.
set.seed(42)
down_sce <- downsample_sce(
full_sce,
n_keep = 100000)
## [ 2023-07-14 14:07:06 ] Creating downsampled SCE
## [ 2023-07-14 14:07:06 ] Done.
down_sce
## class: SingleCellExperiment
## dim: 58 100000
## metadata(0):
## assays(1): transformed
## rownames(58): Time length ... ID_1 ID_2
## rowData names(0):
## colnames(100000): cell_1109989 cell_54425 ... cell_555638 cell_1901363
## colData names(13): file_name pool_id ... fine_annot sex
## reducedDimNames(1): umap
## mainExpName: NULL
## altExpNames(0):
2. Test and tweak plots on the reduced SCE object. (use ‘object = down_sce’)
3. Make final plots using the full SCE object. (switch to using ‘object = full_sce’)
Note: For the final version of this vignette, we will be using the ‘full_sce’. To implement this suggested workflow for your own data, replace the ‘full_sce’ in code below with ‘down_sce’ for the initial views and tweaking stage, then switch back to ‘full_sce’ when producing final figures!
Marker expression or cell metadata can be plotted on the UMAP using
dittoDimPlot()
Some use cases:
Primary inputs = object and var. (You can leave out those input names if you like, as long as you provide the ‘object’ first, and ‘var’ second.)
# CD3 Marker Expression
dittoDimPlot(object = full_sce, var = "CD3")
# Sample metadata example 1: sample groups or processing batch metric
dittoDimPlot(full_sce, "pool_id")
# Sample metadata example 2: coarse-level cell type annotations
dittoDimPlot(full_sce, "coarse_annot")
# Label the color groups & remove the legend
dittoDimPlot(full_sce, "coarse_annot",
# Add labels
do.label = TRUE,
# Remove the legend
legend.show = FALSE)
# Adjust plot order, and also the title
dittoDimPlot(full_sce, "CD16",
# Plot cells with higher expression in the front with 'order = "increasing"'
# Also try: "decreasing" or "randomize"
order = "increasing",
# Adjust the plot title
main = "CD16 Expression")
# Only highlighting certain cells with 'cells.use'
dittoDimPlot(full_sce, "cluster",
cells.use = full_sce$cluster==4)
## Faceting is VERY useful!
# Example 1: Simple recreation of the effect of the 'split_umap_by_clustering.png'
dittoDimPlot(down_sce, "cluster",
# Create faceted plots where points are 'split' into different facets by a
# metadata given to 'split.by'.
# Faceting can help make distribution differences more visible!
split.by = "cluster")
# Example 2: Batch Correction Assessment
# Ideally, every batches would have cells in all the gray regions of the umap
dittoDimPlot(full_sce, "pool_id",
split.by = "pool_id")
Cell frequencies per-sample, grouped by one or more important metadata can be
plotted using dittoFreqPlot()
Use case:
Primary inputs = object, var, sample.by, and group.by. (You can leave out those input names if you like, as long as you provide the ‘object’ first, and ‘var’ second.)
# Coarse-level frequencies per sample, between sexes
dittoFreqPlot(object = full_sce, var = "coarse_annot",
sample.by = "donor_id", group.by = "sex")
# Cluster frequencies per sample, between sexes, with fine-level annotations in
# facet labels.
dittoFreqPlot(object = full_sce,
var = paste(full_sce$fine_annot, full_sce$cluster, sep = "__"),
sample.by = "donor_id", group.by = "sex")
# Allow y-axis to shrink/stretch per data in each facet
dittoFreqPlot(full_sce, "coarse_annot", "file_name", group.by = "sex",
split.adjust = list(scale="free_y"))
# Only show certain cell types
# Here: all fine-level annotations in coarse-level "CD4T"
dittoFreqPlot(full_sce, "fine_annot", "file_name", group.by = "sex",
vars.use = unique(full_sce$fine_annot[full_sce$coarse_annot=="CD4T"]))
# **Adjust percentage normalization to a restricted universe** & only show cell
# types within that universe.
# Here: all fine-level annotations in coarse-level "CD4T" cells
dittoFreqPlot(full_sce, "fine_annot", "file_name", group.by = "sex",
# cells.use adjusts the 'universe' for percent calculation.
cells.use = full_sce$coarse_annot=="CD4T",
# vars.use limits cells types shown, but does not affect percent calculation
vars.use = unique(full_sce$fine_annot[full_sce$coarse_annot=="CD4T"]))
# Coarse-level frequencies per sample, between sexes + also between batch
# subgroup with color.by
# Also targeting just one cell type, for visibility in the same sized plot
# For all cells, you'd want to make this plot quite large!
dittoFreqPlot(full_sce, "coarse_annot", "file_name",
group.by = "pool_id",
color.by = "sex",
vars.use = c("CD8T", "NK", "B", "ASC"))
Cell metadata composition per / grouped by any other cell metadata can be
plotted with dittoBarPlot()
Some use cases:
Primary inputs = object, var, group.by. (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘var’ second, and ‘group.by’ third.)
# Sex (or any metadata) breakdown within each cluster.
dittoBarPlot(object = full_sce, var = "sex", group.by = "cluster")
# Cluster make-up per pool / processing batch
# Ideally, you would see relatively similar distributions here.
dittoBarPlot(full_sce, "cluster", "pool_id")
# factor-ized "cluster" metadata for this next example
full_sce$cluster_factor <- factor(
full_sce$cluster,
levels = 1:36
)
# Respect factor ordering using 'retain.factor.levels'
# A 'flaw' relating to updates retaining backwards compatibility, the
# developers plan to remove the need for this input in a future update. But
# as of the writing of this vignette, it is still needed.
dittoBarPlot(full_sce, "sex", group.by = "cluster_factor",
retain.factor.levels = TRUE
)
# Ignore certain samples
# Perhaps, as for the toy data set here, certain samples were only included as
# a batch control, but do not have full information and not intended for
# inclusion in down-stream biological questions.
dittoBarPlot(full_sce, "sex", group.by = "cluster",
cells.use = !full_sce$donor_id %in% c("HS2")
)
Marker1 by marker2 expression-level scatter plots with dots (cells) optionally colored by metadata or marker3 expression
Some use cases:
Primary inputs = object, x.var, y.var (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘x.var’ second, and ‘y.var’ third.)
dittoScatterPlot(object = full_sce, x.var = "CD45RA", y.var = "CCR7")
dittoScatterPlot(full_sce, "CD45RA", "CCR7",
color.var = "sex")
# Faceting (again, it's super useful!)
dittoScatterPlot(full_sce, "CD45RA", "CCR7", color.var = "cluster",
split.by = "cluster")
# Setting titles & ...
# The same 'main', 'sub', 'xlab', 'ylab' inputs can set titles across all
# dittoSeq visualization functions (only exception is dittoHeatmap, the sole
# non-ggplot function, where only 'main' can be used)
# focusing on only certain samples + cell types with cells.use in a case where perhaps the
# control samples are particularly useful for deciding on meaningful value
# cutoffs for the target markers.
dittoScatterPlot(full_sce, "CD45RA", "CCR7", color.var = "cluster",
split.by = "cluster",
cells.use = full_sce$control_sample & full_sce$coarse_annot == "CD4T")
Plots where marker expression-level per cell are plotted as violin and/or box plots on a y-axes, or ridge plots in the x-axis direction.
Some use cases:
Primary inputs = object, var, group.by (You can leave out those input names if you like, as long as you provide the ‘object’ first, ‘var’ second, and ‘group.by’ third.)
Very important additional input = plots, which sets the data representations to use. I defaults to c(“jitter”, “vlnplot”) which puts a violin in front of jittered points for the individual cells, but changing to c(“jitter”, “vlnplot”, “boxplot”) will add boxplots on top. Additionally, wrappers dittoBoxPlot, dittoRidgePlot, and dittoRidgeJitter automatically adjust the plots input default to c(“boxplot”, “jitter”), c(“ridgeplot”), and c(“ridgeplot”, “jitter”), respectively!
dittoPlot(object = full_sce, var = "CD45RA", group.by = "cluster")
# For better examples, we'll focus on clusters 1:8 with cells.use
dittoPlot(object = full_sce, var = "CD45RA", group.by = "cluster",
cells.use = full_sce$cluster %in% 1:8)
# Add a boxplot
dittoPlot(full_sce, "CD45RA", "cluster",
cells.use = full_sce$cluster %in% 1:8,
plots = c("jitter", "vlnplot", "boxplot"))
# With boxplot, but no violin plots
dittoPlot(full_sce, "CD45RA", "cluster",
cells.use = full_sce$cluster %in% 1:8,
plots = c("jitter", "boxplot"))
# Multiple genes in a single plot, by giving a set of markers to 'var'
dittoPlot(full_sce,
var = c("CD45RA", "CD27", "CCR7"),
group.by = "cluster",
cells.use = full_sce$cluster %in% 1:8,
# Facets will be used for the 'multivars'
# so we can set the faceting shape with split.ncol or split.nrow
split.ncol = 1
)
# Faceting or coloring as additional cell grouping mechanisms
dittoPlot(full_sce, "CD45RA",
group.by = "sex",
split.by = "coarse_annot")
dittoPlot(full_sce, "CD45RA",
group.by = "coarse_annot",
color.by = "sex",
legend.title = "sex")
Frequency comparison is a very common follow up in cytof data analysis that can power identification of populations of interest.
The ‘file_by_cluster_freq_norm’ object output from cyclone can be used for calculating stats, but this is only directly useful at the cluster-level. Additional manipulation is required when wanting to calculate stats at the level of coarse or fine cell annotations, which will often combine certain clusters.
For this reason, it can be easier to piggy-back off of dittoSeq’s dittoFreqPlot data collection in order to gather cell frequency numbers in a more flexible way!
We’ve put together a function to help with this exact purpose, freq_stats().
freq_stats() is a function included in the cyclone package which flexibly
generates statistical comparisons between user-requested cell and sample
groupings by making use of dittoFreqPlot()’s data collection system.
Primary:
object: the SingleCellExperiment object to target (works for Seurat objects too, but we’re not using that here). As with plotting, test with ‘down_sce’, but make sure to switch to ‘full_sce’ to make your final outputs!group.by: String. The name of a metadata within ‘object’ that holds the condition information you wish to compare between.group.1 & group.2: Single values of the ‘group.by’ metadata which name the groups to compare.sample.by: String. The name of a metadata within ‘object’ that contains values which are unique for each sample. Typically, this can be “file_name” for cyclone outputs.cell.by: String. The name of a metadata within ‘object’ that contains the cluster or cell annotation information you wish to target.Secondary:
cell.targs: String vector, optional. If targeting just a subset of the cell clusters or annotations named in the ‘cell.by’ cell annotation metadata is desired, give that set of cell targets here.cells.use: Logical vector, optional. If targeting only a certain timepoint/treatment/condition is required for your biological question, and your are using group.by for something else, give this input <object-name>$<condition-metadata-name> == '<target-condition>' OR <object-name>$<condition-metadata-name> %in% c('<target-condition1>', '<target-condition2>', '<target-condition3>')data.out: Logical. FALSE by default. Setting it to TRUE will alter the output style to give a list of 2 elements: ‘stats’ = the standard output, and ‘data’ = the collected data.frame of cell counts and percentages used for the statistical calculations.The function does 4 things:
wilcox.test function. This is a fitting statistical test for comparison of frequency data because, unlike a t-test, it does not assume the data has a normal distribution.To give a better understanding of the inputs to the function, additional details on that first step can be helpful:
The function first makes a call to dittoFreqPlot to gather and normalize cell count data. Within dittoFreqPlot, the number of cells of each ‘sample.by’ sample assigned to each distinct ‘cell.by’ value are gathered. Cells’ group information is also pulled in at this stage, and only samples whose cells are marked as ‘group.1’ and ‘group.2’ in the ‘group.by’ metadata are targeted for counting. The total number of cells for a given sample are then calculated (ignoring any ‘cell.targs’ targeting, but respecting any ‘cells.use’ subsetting) and the counts data is then normalized as percentages of all cells for the given sample. Finally, if a set of cell names was given to ‘cell.targs’, this data.frame is trimmed to only retain rows representing those ‘cell_meta’ values. At this point, the data has been collected and normalized for the stats calculation steps.
For additional details, check out the documentation with ?freq_stats.
The frequency comparisons to run should be guided by your biological questions.
Was your study designed to assess differences between 2 groups? Then you want to target the metadata holding which cells belong to those 2 groups with group.by and give the names of those groups to group.1 and group.2.
Was your study designed to assess difference between 3 groups? This function performs just pairwise comparisons, so you’ll want to run it 3 times, targeting 1vs2, 1vs3, an 2vs3 to get the full picture of statistically significant differences. You likely have the group info in a single metadata, and if so you would keep group.by the same for all runs, while adjusting group.1 and group.2 for each of the distinct comparisons.
Another factor is what cell annotation level to target. Often, you’ll want to assess all the levels you have. Here, we have two annotation levels, “coarse_annot”, and “fine_annot”, but we also have the individual clusters level as well. So that makes 3 cell-levels that we would target here.
Finally, is the question of whether you want the ‘universe’ that cell percentages are normalized within to be “all cells of the given sample” versus “all cells of a certain coarse annotation type”. For example, you might be more interested in knowing the percentage of Treg cells out of CD4T cells, rather than the percentage of Treg cells out of all cells of the given sample. This can be achieved with our function as well! Just skip down to the next section to see how!
Here we will compare across ‘sex’ which has only two groups in the data -> 1x pairwise group comparison. Plus, we have the “coarse_annot”, “fine_annot”, and “clusters” levels of cell annotation to compare between -> 3x cell type levels. That makes for 3 comparisons total.
How this plays out in inputs to the function:
group.by, group.1, group.2: In the data set for this tutorial, our metadata containing the sample grouping information is called ‘sex’, so we’ll use that for ‘group.by’. Its values are “F”, representing female, and “M”, representing Male, so we’ll use ‘group.1 = “F”’ and ‘group.2 = “M”’.sample.by: Typically, “file_name” can be used for this input for any cyclone data set unless you had multiple fcs files per sample.cell.by: Our two levels of annotations are stored in “coarse_annot” and “fine_annot” metadata columns, and we can also calculate statistics at the original “cluster” level, but we’ll still want an idea of what those cells are, so we’ll create a metadata column that’s a combination of “fine_annot” and “cluster” below, and use that!# Coarse level
freq_stats(
object = full_sce,
sample.by = "file_name",
cell.by = "coarse_annot",
group.by = "sex", group.1 = "F", group.2 = "M")
## cell_group comparison median_g1 median_g2 median_fold_change
## 1 CD8T F_vs_M 0.2168043 0.205610000 1.0544445
## 2 NK F_vs_M 0.0550800 0.066380657 0.8297598
## 3 B F_vs_M 0.0458200 0.055990000 0.8183604
## 4 ASC F_vs_M 0.0617000 0.033520670 1.8406553
## 5 gdT F_vs_M 0.0119200 0.023830236 0.5002049
## 6 pDC F_vs_M 0.0043200 0.004360044 0.9908158
## 7 CD4T F_vs_M 0.4012000 0.322700000 1.2432600
## 8 neutrophils F_vs_M 0.0059200 0.008490071 0.6972851
## 9 basophils F_vs_M 0.0025200 0.002360026 1.0677847
## 10 cMono F_vs_M 0.1216200 0.122500000 0.9928163
## 11 ncMono F_vs_M 0.0111600 0.016650165 0.6702636
## 12 cDC1 F_vs_M 0.0142200 0.017800177 0.7988684
## 13 cDC2 F_vs_M 0.0083000 0.009820000 0.8452138
## median_log2_fold_change positive_fc_means_up_in p padj
## 1 0.07648318 F 0.437486303 0.6319247
## 2 -0.26923434 F 0.140065180 0.3641695
## 3 -0.28919172 F 0.085907678 0.3641695
## 4 0.88021949 F 0.131333339 0.3641695
## 5 -0.99940902 F 0.043941702 0.2856211
## 6 -0.01331125 F 0.591614303 0.7330794
## 7 0.31412803 F 0.008419404 0.1094523
## 8 -0.52017944 F 0.676688680 0.7330794
## 9 0.09462074 F 0.984174128 0.9841741
## 10 -0.01040125 F 0.676688680 0.7330794
## 11 -0.57719943 F 0.284039717 0.4764233
## 12 -0.32397015 F 0.222025353 0.4764233
## 13 -0.24261169 F 0.293183556 0.4764233
# Fine level
freq_stats(
object = full_sce,
sample.by = "file_name",
cell.by = "fine_annot",
group.by = "sex", group.1 = "F", group.2 = "M")
## cell_group comparison median_g1 median_g2 median_fold_change
## 1 CD8T_EMRA F_vs_M 0.01870037 0.030350589 0.6161453
## 2 NK F_vs_M 0.05508000 0.066380657 0.8297598
## 3 plasmablast F_vs_M 0.01774000 0.020390224 0.8700248
## 4 ASC F_vs_M 0.06170000 0.033520670 1.8406553
## 5 CD8T_CM F_vs_M 0.01460000 0.009570191 1.5255703
## 6 CD8T_EM F_vs_M 0.07662000 0.089520000 0.8558981
## 7 gdT_CD8pos F_vs_M 0.00478000 0.010650107 0.4488218
## 8 pDC F_vs_M 0.00432000 0.004360044 0.9908158
## 9 B F_vs_M 0.00898000 0.032140313 0.2793999
## 10 CD4T_EM F_vs_M 0.13832000 0.107430000 1.2875361
## 11 neutrophils F_vs_M 0.00592000 0.008490071 0.6972851
## 12 basophils F_vs_M 0.00252000 0.002360026 1.0677847
## 13 cMono F_vs_M 0.12162000 0.122500000 0.9928163
## 14 CD8T_naive F_vs_M 0.09618000 0.059070599 1.6282212
## 15 gdT_DN F_vs_M 0.00476000 0.013380267 0.3557478
## 16 ncMono F_vs_M 0.01116000 0.016650165 0.6702636
## 17 CD4T_naive F_vs_M 0.13820276 0.105820000 1.3060174
## 18 Tregs F_vs_M 0.02338000 0.020950423 1.1159679
## 19 cDC1 F_vs_M 0.01422000 0.017800177 0.7988684
## 20 CD4T_CM F_vs_M 0.08172000 0.060960000 1.3405512
## 21 cDC2 F_vs_M 0.00830000 0.009820000 0.8452138
## median_log2_fold_change positive_fc_means_up_in p padj
## 1 -0.69865740 F 0.85828854 0.9486347
## 2 -0.26923434 F 0.14006518 0.6481450
## 3 -0.20087160 F 0.98417413 0.9841741
## 4 0.88021949 F 0.13133334 0.6481450
## 5 0.60934869 F 0.85828854 0.9486347
## 6 -0.22448901 F 0.56434676 0.9473642
## 7 -1.15578535 F 0.22202535 0.6481450
## 8 -0.01331125 F 0.59161430 0.9473642
## 9 -1.83959664 F 0.03564774 0.6481450
## 10 0.36461285 F 0.17897317 0.6481450
## 11 -0.52017944 F 0.67668868 0.9473642
## 12 0.09462074 F 0.98417413 0.9841741
## 13 -0.01040125 F 0.67668868 0.9473642
## 14 0.70329668 F 0.30864047 0.6481450
## 15 -1.49107345 F 0.16604655 0.6481450
## 16 -0.57719943 F 0.28403972 0.6481450
## 17 0.38517415 F 0.79643411 0.9486347
## 18 0.15829557 F 0.79272872 0.9486347
## 19 -0.32397015 F 0.22202535 0.6481450
## 20 0.42282630 F 0.67668868 0.9473642
## 21 -0.24261169 F 0.29318356 0.6481450
### Cluster level, but with fine annotations + cluster names to make interpretation easier
# Create combined metadata
full_sce$fine_annot__cluster <- paste0(full_sce$fine_annot, "__cluster", full_sce$cluster)
# Run stats
freq_stats(
object = full_sce,
sample.by = "file_name",
cell.by = "fine_annot__cluster",
group.by = "sex", group.1 = "F", group.2 = "M") # <-- Now using the metadata defined above
## cell_group comparison median_g1 median_g2 median_fold_change
## 1 ASC__cluster6 F_vs_M 0.061700000 0.033520670 1.8406553
## 2 B__cluster12 F_vs_M 0.008980000 0.032140313 0.2793999
## 3 basophils__cluster17 F_vs_M 0.002520000 0.002360026 1.0677847
## 4 CD4T_CM__cluster32 F_vs_M 0.033240000 0.039780000 0.8355958
## 5 CD4T_CM__cluster33 F_vs_M 0.038820000 0.022280223 1.7423524
## 6 CD4T_EM__cluster15 F_vs_M 0.015120000 0.015700156 0.9630478
## 7 CD4T_EM__cluster21 F_vs_M 0.033300000 0.039430000 0.8445346
## 8 CD4T_EM__cluster27 F_vs_M 0.036960000 0.050040968 0.7385948
## 9 CD4T_naive__cluster25 F_vs_M 0.091500000 0.054140000 1.6900628
## 10 CD4T_naive__cluster26 F_vs_M 0.041480000 0.058390607 0.7103882
## 11 CD8T_CM__cluster7 F_vs_M 0.014600000 0.009570191 1.5255703
## 12 CD8T_EM__cluster13 F_vs_M 0.033880000 0.029030000 1.1670685
## 13 CD8T_EM__cluster14 F_vs_M 0.016660000 0.024360252 0.6839010
## 14 CD8T_EM__cluster8 F_vs_M 0.007380000 0.030270296 0.2438034
## 15 CD8T_EMRA__cluster1 F_vs_M 0.006820136 0.006510000 1.0476400
## 16 CD8T_EMRA__cluster2 F_vs_M 0.013280000 0.014370153 0.9241377
## 17 CD8T_naive__cluster19 F_vs_M 0.086700000 0.046700000 1.8565310
## 18 CD8T_naive__cluster22 F_vs_M 0.009300000 0.009910097 0.9384368
## 19 cDC1__cluster30 F_vs_M 0.014220000 0.017800177 0.7988684
## 20 cDC2__cluster36 F_vs_M 0.008300000 0.009820000 0.8452138
## 21 cMono__cluster18 F_vs_M 0.020960000 0.032510317 0.6447184
## 22 cMono__cluster24 F_vs_M 0.038420000 0.058050000 0.6618432
## 23 cMono__cluster29 F_vs_M 0.005260000 0.004740000 1.1097046
## 24 cMono__cluster35 F_vs_M 0.046320000 0.031340000 1.4779834
## 25 gdT_CD8pos__cluster9 F_vs_M 0.004780000 0.010650107 0.4488218
## 26 gdT_DN__cluster20 F_vs_M 0.004760000 0.013380267 0.3557478
## 27 ncMono__cluster23 F_vs_M 0.011160000 0.016650165 0.6702636
## 28 neutrophils__cluster16 F_vs_M 0.002620000 0.005310156 0.4933941
## 29 neutrophils__cluster34 F_vs_M 0.003700000 0.003450000 1.0724638
## 30 NK__cluster10 F_vs_M 0.005600000 0.007830000 0.7151980
## 31 NK__cluster3 F_vs_M 0.009800000 0.034120323 0.2872189
## 32 NK__cluster4 F_vs_M 0.025040000 0.027980273 0.8949162
## 33 pDC__cluster11 F_vs_M 0.004320000 0.004360044 0.9908158
## 34 plasmablast__cluster5 F_vs_M 0.017740000 0.020390224 0.8700248
## 35 Tregs__cluster28 F_vs_M 0.012780000 0.013520405 0.9452379
## 36 Tregs__cluster31 F_vs_M 0.011700000 0.006600000 1.7727273
## median_log2_fold_change positive_fc_means_up_in p padj
## 1 0.88021949 F 0.13133334 0.4298182
## 2 -1.83959664 F 0.03564774 0.4298182
## 3 0.09462074 F 0.98417413 0.9841741
## 4 -0.25912289 F 0.30864047 0.5050480
## 5 0.80103647 F 0.22202535 0.4440507
## 6 -0.05432072 F 0.85828854 0.9655746
## 7 -0.24377153 F 0.95254862 0.9841741
## 8 -0.43714495 F 0.06519038 0.4298182
## 9 0.75707686 F 0.19260224 0.4440507
## 10 -0.49332041 F 0.12105748 0.4298182
## 11 0.60934869 F 0.85828854 0.9655746
## 12 0.22288930 F 0.82723337 0.9655746
## 13 -0.54814068 F 0.09384527 0.4298182
## 14 -2.03621008 F 0.07850178 0.4298182
## 15 0.06714305 F 0.98417413 0.9841741
## 16 -0.11382030 F 0.73113144 0.9400261
## 17 0.89260944 F 0.41416497 0.5734592
## 18 -0.09166846 F 0.32823925 0.5137658
## 19 -0.32397015 F 0.22202535 0.4440507
## 20 -0.24261169 F 0.29318356 0.5026695
## 21 -0.63325891 F 0.13133334 0.4298182
## 22 -0.59543855 F 0.12105748 0.4298182
## 23 0.15017574 F 0.19260224 0.4440507
## 24 0.56363007 F 0.34859862 0.5228979
## 25 -1.15578535 F 0.22202535 0.4440507
## 26 -1.49107345 F 0.16604655 0.4440507
## 27 -0.57719943 F 0.28403972 0.5026695
## 28 -1.01918749 F 0.37380340 0.5382769
## 29 0.10092891 F 0.29322390 0.5026695
## 30 -0.48358548 F 0.11139863 0.4298182
## 31 -1.79977766 F 0.05923987 0.4298182
## 32 -0.16017547 F 0.85564793 0.9655746
## 33 -0.01331125 F 0.59161430 0.7888191
## 34 -0.20087160 F 0.98417413 0.9841741
## 35 -0.08125055 F 0.20694986 0.4440507
## 36 0.82597060 F 0.08212935 0.4298182
Of course, it’s also helpful to have a visual. We’ve described dittoFreqPlot in a previous section. Here’s how we might use it for the cluster-level:
# Visualization of frequency differences at the cluster-level
dittoFreqPlot(
object = full_sce,
var = "fine_annot__cluster",
group.by = "sex",
sample.by = "file_name",
# Optionally target only group.1 and group.2 with:
# (modify "sex", "F", and "M" for your own data)
cells.use = full_sce$sex %in% c("F", "M"),
# Allow y-axis to shrink/expand per range of each cell type
split.adjust = list(scale = "free_y")
)
One might also wish to assess cell frequency in terms of “fine” annotation per “coarse” annotation. For example, you might be more interested in knowing the percentage of Treg cells out of CD4T cells, than in the percentage of Treg cells out of all cells of the given sample. Luckily, to achieve such metrics, simply subset the SCE first!
# Create a subset of our data that is only the cells labeled as CD4T in coarse_annot
full_sce_CD4 <- full_sce[, full_sce$coarse_annot=="CD4T"]
# Calculate fine-level statistics within these CD4T cells
freq_stats(
object = full_sce_CD4,
sample.by = "file_name",
cell.by = "fine_annot",
group.by = "sex", group.1 = "M", group.2 = "F",
# Explicitly targeting just the values of 'fine_annot' here that have a 'coarse_annot' of CD4T
cell.targs = c("CD4T_EM", "CD4T_naive", "Tregs", "CD4T_CM")
)
## cell_group comparison median_g1 median_g2 median_fold_change
## 1 CD4T_EM M_vs_F 0.32712653 0.34146341 0.9580134
## 2 CD4T_naive M_vs_F 0.36019454 0.38806264 0.9281866
## 3 Tregs M_vs_F 0.06723778 0.06680353 1.0065003
## 4 CD4T_CM M_vs_F 0.21324079 0.21663548 0.9843300
## median_log2_fold_change positive_fc_means_up_in p padj
## 1 -0.061882246 M 0.6194476 0.7659395
## 2 -0.107513236 M 0.7659395 0.7659395
## 3 0.009347672 M 0.3282393 0.7659395
## 4 -0.022786070 M 0.6194476 0.7659395
# And the visualization
dittoFreqPlot(
object = full_sce_CD4,
var = "fine_annot",
group.by = "sex",
sample.by = "file_name",
# Optionally target only group.1 and group.2 with:
# (modify "sex", "F", and "M" for your own data)
cells.use = full_sce_CD4$sex %in% c("F", "M"),
# vars.use is the cell.targs equivalent in the dittoFreqPlot function
vars.use = c("CD4T_EM", "CD4T_naive", "Tregs", "CD4T_CM"),
# Allow y-axis to shrink/expand per range of each cell type
split.adjust = list(scale = "free_y"),
# Update the y-axis label to mention the adjusted universe
ylab = "Percent of CD4T cells"
)